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ABSTRACT 

impact of feedback about consequences for the 1998 achievement levels -setting 
(ALS) process for the National Assessment of Educational Progress (NAEP) . The 
field trials provided the opportunity to try out different methods similar to 
those used successfully by others, as well as to try out some new methods. 

The American College Testing program (ACT) had proposed a new method to be 
tested in the field trials. Although successful implementation of the method 
had been reported, the method was found to be biased, and the ACT stopped 
tests with the method after the first field trials. Reservations about item 
maps were not overcome in the field trial process, and item maps were 
eliminated as a choice. Concerns about computational procedures and the 
logistic demands of the Booklet Classification Method eliminated this 
approach. The Technical Advisory Committee on Standard Setting recommended a 
new combination method based on the method developed by M. Reckase in 
conjunction with the strong research base and extensive experience by ACT 
associated with the Mean Estimation method. Procedures based on this approach 
were used to set achievement levels for the 1998 NAEP in civics and writing. 
Appendixes contain examples of charts used in the rating method study. 
(Contains 18 tables, 33 figures, and 25 references.) (Author/SLD) 
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Field Trials to Determine Which Rating Method(s) to Use 
in the 1998 NAEP Achievement Levels-Setting Process for Civics and Writing^ 



Susan Cooper Loomis, ACT, Inc. 
and 

Luz Bay, Advanced Systems 
Wen-Ling Yang, ACT, Inc. 
Patricia L. Hanick, ACT, Inc. 



Introduction 

ACT proposed several stages in preparation for the 1998 Achievement Levels-Setting (ALS) 
Process (ACT, 1997). These included simulation studies followed by field trials, followed by pilot 
studies. The mag or focus of these research studies was identification of a rating method for setting 
the achievement levels. Because achievement levels-setting is a judgmental process, the best way 
to improve the outcomes of the process is to improve the method of collecting judgments. In 
designing the methodology for setting achievement levels for the 1998 Writing and Civics NAEP, 
ACT did not seek to find a “true standard,” rather, ACT tried to design a data collection procedure 
based on a judgment task that paneHsts could easily comprehend so that systematic bias in 
judgments would be minimized. 

In an effort to improve procedures used for the 1998 NAEP achievement levels-setting (ALS) 
process, ACT proposed to use an item-by-item rating method that was somewhat different fi“om the 
modified-Angoff rating method used in recent years. The change of method was proposed in 
response to criticisms that the modified-Angoff method could not produce vaHd cutpoints because 
panehsts were incapable of performing the task of estimating probabihties with reasonable 
accuracy (NAE, 1993; Shepard, 1995; Impara and Plake, 1996). 

ACT proposed a rating procedure that required judges to estimate the most likely response of 
borderline student performance at each achievement level (ACT, 1997f). This method differed fi“om 
the modified-Angoff method in that panehsts were asked to estimate the most likely response, 
rather t han the probabihties of correct responses of student performance at the borderline of each 
achievement level. Angoff (1971) described this procedure; Impara and Plake (1997) worked with 
this rating method with dichotomous items; and Hambleton and Plake (1995) used a procedure 
somewhat similar in a standard setting study involving poljrtomous items. These studies reported 
success with using the method. 

ACT conducted the simulation studies (Chen, 1998) and determined that the proposed method 
(and computational procedures developed for it) was feasible in the context of NAEP. ACT called 
the method the “Item Score String Estimation (ISSE) Method because the ratings would produce 
an “item score string” for students performing at the borderline of each achievement level. For 
dichotomous items, panehsts would judge whether students performing just at the borderline at 
the achievement level were more likely to respond correctly or incorrectly. For poljrtomous items. 



1 This research was conducted under contract 5^07001001 with the National Assessment Governing Board. Susan Loomis wrote this 
report, but the report draws heavily upon earlier reports by the author, Luz Bay and Patricia Hanick. Wen-Hung (Lee) Chen at ACT 
developed the analyses programs for these field trials and helped with on-site analyses. Wen-Ling Yang performed the analyses and 
produced feedback for the studies, as well as additional analyses for reporting on the studies. Teri Fisher at ACT coordinated the 
acquisition and production of materials for each of the studies and assisted with materials in this and earlier reports. Jill Crouse at 
ACT conducted the analysis of FT2 data and prepared summary reports to share with our Technical Advisory Committee on Standard 
Setting. 



Loomis/Montreal/NCME/April 1999 

O 

ERIC 



3 



t 



they wovild judge the most likely score (e.g., 1-4) for students performing just at the borderline of 
each level. 

The computations required to determine the cutpoints were simplified using the proposed ISSE 
method. ACTs proposed new method combined two forms of assessment items; judges would 
provide expected scores for performance items, and correct/incorrect scores for multiple-choice 
items. Previously, cutpoints needed to be computed separately for dichotomous and pol 5 d;omous 
items. Concerns had been raised regardin g how to combine ratings for items that were generated 
fi*om different rating methods. The ISSE method eliminated this concern. 

Overview of the Field Trials 

Two field trials were originally proposed in which research studies were conducted with panelists 
to determine which rating method to use in the ALS process. Each field trial had a unique pxorpose, 
but they both address the common issue of examining rating methods. NAGB asked ACT to plan 
separate field trials for writing and for civics and to examine more alternative methods for setting 
achievement levels in writing. As a result, ACT scheduled a pair of field trials for each of the two 
subjects. 

This was the first time that field trials, i.e., studies with panelists, had been conducted prior to the 
pilot studies. In 1994, ACT conducted the pilot studies for geography and U.S. History with major 
research components included. Four different rating methods were tried out diuing those two pilot 
studies, and there were variations in feedback provided to panelists. The decision of which rating 
method to use needed to be made prior to the pilot studies, and the decision needed to be informed 
by research involving panelists. ACT and TACSS felt that it was important to conduct the 
research for identifying methods prior to the pilot studies so the pilot studies could be “dress 
rehearsals” for the ALS process. Thus, the field trials were included in the 1998 process. As 
happens so often, the field trials grew in scope and complexity as the details of the designs were 
being worked out. 

The initial purpose of the first field trial (FTl) was to compare the ISSE method to the combined 
methods of modified-Angoff and mean estimation (ME), which had been used by ACT in ALS 
processes for geography, U.S. History, and science. (See ACT, 1997f ) Results fi*om FTl were to 
determine which item-by-item rating method to use for the remaining ALS studies, including the 
second field trial (FT2). The selected method was to be used in civics and perhaps writing. By the 
time the field trials were actually designed, however, four methods were being considered for FTl 
for writing: the mean estimation method, the ISSE, the Booklet Classification Method, and a new 
method named The Grid Method. 

As proposed, the key issue for FT2 was to study the ratings produced fi*om panelists using a 
sequence of rating methods: one method followed by a second, different method. The first method 
would be the item-by-item method selected as a result of FTl (either ISSE or the combined 
methods of modified-Angoff and ME), and the second method was to be an item-mapping method. 
Field trial 2 was to address several issues: how the two methods interfaced with each other when 
used together in a rating sequence; how the panelists evaluated the two methods when used jointly 
and independently; and how the cutscores that were produced by the two methods differed. 

The effect of consequences data on panelists’ ratings was also investigated in both FTl and FT2. 
NAGB had never approved the introduction of consequences data during the rating process — 
before the final cutscores were computed. ACT began providing panelists with consequences data 
in 1994, based on the final cutpoints, and collecting panelists’ reactions to the data. Those data 
were reported to NAGB and considered during their deliberations regarding the cutpoints to set 
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for each achievement level in geography, U.S. History, and science. Indeed, it was the reaction of 
grade 8 science paneUsts to the consequences data that led to the decision to reconvene the panel 
and provide them the opportunity to reconsider their cutpoints (ACT, 1997c). 

The field trials were designed to collect data to determine the extent of impact by consequences 
data on ratings. ACT needed to know when to introduce consequences data and how often to 
provide the data if they were to be used in the process. 

Field Trial 1: A Comparison of Two Item-by-Item Rating Methods 

The pxupose of the field trials was to determine the rating method to be used to set achievement 
levels for the 1998 Civics NAEP and the 1998 Writing NAEP. The Technical Advisory Committee 
on Standard Setting (TACSS) recommended that a minimum of ten panehsts be recruited for each 
method group. Because ACT was not able to recruit that many panehsts for the scheduled dates of 
the field trials, the design of FTl in each subject was modified. Only the ISSE method was 
implemented for the FTl in civics and only the ISSE and ME methods (as originally planned) were 
implemented for FTl in writing. Consequences data were introduced before the final cutpoints 
were set for both subjects in FTl. 

Data 

ACT used items fi*om the 1994 Geography NAEP in the field trials for civics and 1992 NAEP 
Writing data for the field trials for writing. ACT wanted to include feedback to panehsts in the 
field trials, so it was necessary to use NAEP data that were already available. The 1998 
assessment data were stiU being cohected at the time of the field trials. 

The Geography NAEP was quite similar to the Civics NAEP in terms of the types of items 
(multiple-choice, short constructed response, and extended constructed response) and the relative 
fi-equency of each. Further, neither geography nor civics represents a “core” course in the 
curriculum, and the two were judged to be similarly represented in the curricular offerings of 
schools at the grade levels tested by NAEP. Of ah of the subjects for which achievement level data 
were available, geography seemed the most logical substitution for civics. ACT used the 
acMevement levels descriptions for the 1994 Geography NAEP in the field trials. Panehsts were 
asked to avoid reference to any reports on the Geography NAEP prior to participation in the civics 
field trials. 

ACT had worked with the 1992 Writing NAEP data, and that seemed the most obvious choice of 
data to use for the field trials. The problems experienced with the assessment data in 1992 were of 
concern, however. The fi*amework document had been revised somewhat, and the test 
specifications had been sharpened and tightened since the 1992 assessment. ACT used the 
achievement levels descriptions that had been developed for the 1998 Writing NAEP in the field 
trials for writing. The generic scoring rubrics for 1998 were specifically worded to avoid using 
exactly the same terms used in the achievement levels descriptions. The correspondence, or lack 
thereof, between the 1992 scoring rubrics — specific to each prompt — and the 1998 ALDs was not 
taken into accoimt for the field trials. 

Rather than use all of the items in the 1994 geography assessment, ACT used only the four blocks 
of items that had been used in vafidation studies for the geography achievement levels. Those 
studies were conducted with data for grade eight only (ACT, 1995). The four item blocks were 
selected to maximize the representation of the content fi*amework and the characteristics of the 
entire item pool for grade eight (Carlson, 1995). This choice to use only four representative blocks 
was made to decrease the amount of time required by the rating process. 
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There were no similar concerns about time for rating writing prompts. The 1992 Writing NAEP for 
grade 8 included only 11 prompts in all, and only 9 of those were 25-minute prompts. Since only 
25-minute prompts were to be used in reporting the 1998 NAEP, only the nine 25-minute prompts 
from the 1992 NAEP for grade 8 were used in the field trial. 

Panelists 

Twenty persons were to be empanelled in Iowa City for FTl in civics and 40 persons were to be 
empanelled for •writing: 10 panehsts for each rating method in the trials. The process for each 
subject was planned to last two days. Panehsts were recruited from the coimties in eastern Iowa, 
aroimd ACFs national headquarters. The recruiting process was somewhat similar to that 
planned for the actual ALS and pilot studies in that persons in specific positions (superintendents, 
curriculum supervisors, mayors, and so forth) were asked to nominate teachers, nonteacher 
educators, and general pubhc representatives to serve on the panels. Panels were to be drawn 
from the nominees to optimize the composition with respect to the targeted demographic attributes 
for panels. Our highest priority is given to selecting panehsts with the best qualifications. NAGB 
specifies that the panels are to include three types of judges, and 55% of the panehsts should be 
teachers, 15% nonteacher educators, and 30% general pubhc representatives. (Please see 1997g 
for details.) Panehsts were offered a smah honorarium of $100 to participate in field trial #1. 

The first field trial for civics was conducted February 7-8, 1998 and the first field trial for •writing 
was conducted February 28-March 1, 1998. ACT contacted himdreds of persons and asked for 
nominations: school officials, elected officials, and companies hkely to employ persons actively 
engaged in working with knowledge and skhls related to the subject areas. Despite intensive 
efforts to recruit panehsts, only 8 persons could be recruited for the first field trial for civics and 15 
for writing. Teachers were simply too busy during this part of the school term to participate in 
these studies. Further, the curriculum supervisors suggested that teachers are unw illin g to spend 
two weekend days working for such a smah fee.^ 

The ci^vics panel included 2 current teachers and one in her first year of retirement, 3 nonteacher 
educators, and 2 general pubhc members. There were two men and five women. The composition of 
the writing FTl panel was unique. For the first time ever, more general pubhc members than 
educators were included on a NAEP ALS panel. The writing FTl panel included 4 teachers, 3 
nonteacher educators, and 8 members of the general pubhc. There were six men and 9 women in 
the •writing FTl. 

Process 

Tr a in in g. The NAEP ALS process typicahy lasts five days. Ah aspects of the ALS process, except 
selection of exemplar items, were covered in the two-day field trials, at least to some extent. ACT 
wanted to ascertain how panehsts reacted to the rating methods and to other procedural changes 
proposed for the 1998 NAEP ALS process, so some aspects of the process were sacrificed in the 
context of coUecting the field trial data. Relative to the typical ALS process, field trial time was 
greatly reduced for training in the frameworks and achievement levels descriptions. Further, 
there were only two roimds of item ratings. 

Panehsts were provided an abbreviated orientation to the achievement levels-setting process and 
the process designed for the field trial. The orientation session included a general orientation to 
the NAEP program and the process of developing NAEP achievement levels. Panehsts 
participated in several training exercises that were the same as those provided to ALS panehsts. 



2 ALS panelists are paid no honorarium, and the $100 had been judged “appropriate" for field trial panelists. Field trial #2 panelists 
were paid $300 for the two-day study. 
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Included in the tr ainin g was administration of a form of the NAEP for each panelist to complete. 
By taking the NAEP and scoring their work, participants become famUiar with specific NAEP 
items and scoring rubrics and with the general format of the assessment and the conditions under 
which it is administered. 

Since only 8 panelists were recruited for civics, only the ISSE method was implemented. ACT 
judged, and the Technical Advisory Committee on Standard Setting (TACSS) concurred, that the 
results of the ISSE method could be compared to the data collected in the geography ALS process 
using the Mean Estimation Method. ACT already had considerable experience with the ME 
method in assessments similar to civics. Both the Mean Estimation and the ISSE were 
implemented in the FTl for writing, but the other alternatives being considered had to be 
eliminated for the first writing field trial. 

Panelists spent the first day in tr ainin g and preparation for rating items at the end of the first 
day. Panelists reviewed assessment items, scoring rubrics, and student papers, and they were 
engaged in exercises to become more famihar with the achievement levels descriptions before the 
first rating session. The process implemented for each subject was quite similar, but some 
adjustments were needed to accommodate specific features of the assessments in the two different 
subjects. 

ACT t.ypi rall y uses a paper selection exercise to train panelists in the scoring rubrics for 
polsdomous items, to give them a “reality check” prior to the first round of ratings, and to give 
them experience in appl 3 dng their concept of borderline performance with respect to student 
performance. The paper selection process was not implemented in the civics field trial, but it was 
implemented in the writing field trial with papers written in response to three prompts, one of 
each of the three types of writing assessed by NAEP. Three student papers were included for each 
of 6 score points for a narrative prompt and for an informative prompt; the two highest score 
points had been collapsed for the persuasive prompt and only papers for the 5 score points were 
included. Each panehst thus had 51 student papers fi*om which to select one paper to represent 
borderline performance for each achievement level. 

Ratings and Feedback. There were two rounds of item-by-item ratings. Panelists were asked to 
form a concept of students performing at the borderline of each level. For the ISSE method, they 
were asked to judge whether such students were more likely to answer each multiple-choice item 
correctly or incorrectly. For constructed response items, they were asked to estimate the most 
likely score for such students. 

Panelists reported no special problems with the rating methodology, and the first round of item 
ratings went smoothly for civics. Writing panelists had more trouble with the rating methods, and 
this seemed especially true for some panelists in the ME group. Writing panelists had difficulty 
reconciling the scoring rubrics with the scores they had seen for some student papers in the paper 
selection process. They also had problems with the scoring rubrics relative to the achievement 
levels descriptions and the limited amount of time (25 minutes) allowed for student responses. 

The first round of ratings was collected in the afternoon of the first day. After rating all items in 
their rating pools, panelists left for the day. The rating forms were collected for computation of 
outpoints and other feedback information overnight. AU cutpoints reported to panelists and 
presented here are reported on the ACT NAEP -Like scale. 
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Prior to the second round of rating on the second day of FTl, panelists were given feedback data 
resvilting from the ratings they provided in the first round. For FTl writing, separate sets of 
outpoints and other feedback were computed for panelists in each of the two methods groups. 

Panelists were provided with cutpoints, rater location charts, p-value tables, and whole booklet 
feedback. They were instructed in the source, meaning, and use of the feedback information. These 
feedback data were the same as used in previous NAEP ALS processes (ACT, 1997a, 1997b, 
1997c). Rater location data are provided as charts showing the location of the cutscore for each 
panelist at each achievement level. P-value tables report the percentage of students answering 
each dichotomous item correctly and both the average score for pol 5 domous items and the 
percentage of students scoring at each rubric point. The whole booklet feedback reports the 
expected percent correct score for the set of items in the NAEP exam booklet that panelists took 
earlier for practice. For example, the whole booklet feedback report might state: "Based on your 
group's average ratings, students performing at the borderline Basic level are expected to get 49% 
of the total possible score points for this booklet." (A similar statement is given for each 
achievement level.) This feedback was based on the cutpoints the group had set during the 
previous round of ratings. 

Panelists also participated in the whole booklet exercise, an extension of the whole booklet 
feedback. This exercise was added to the ALS process in 1994 in response to the NAE (1993) 
recommendation to include more “holistic” procedures in the process. To illustrate borderline 
Basic performance, they were shown copies of booklets with scores around 49% of the total possible 
points, for example. A few booklets scored within 2% of the cutpoint of each achievement level 
(above or below) were shared with panelists for their evaluation. They were asked to examine the 
responses of students and determine whether that performance represented their expectations for 
students at the lower borderline Basic level, for example. If they perceived a discrepancy between 
the performance expected and observed in the booklets scored at the cutpoint, then they were to 
discuss the achievement levels descriptions and borderline performances again with other 
panelists and try to understand the cause for this discrepancy. They were told that if they judged 
the performance to be too low relative to the description for achievement at the level, they shovild 
increase their ratings for the leveKs). If they judged the performance to be higher than they wovild 
expect relative to the description for achievement at the level, then they shovild decrease their 
ratings for the level(s). 

During the second round of ratings, panelists again rated all items in their item pool. They were 
told that they could change ratings for any items at any levels. Ratings were collected and 
feedback data produced for their review within about two hours. 

Consequences Data. The feedback information described above was updated after the second 
round of ratings. The percentages of students scoring at or above each achievement level based on 
the cutpoints that they set on the second round were provided as consequences data. ACT had 
proposed to introduce consequences data before the final round of ratings, and this field trial was 
one of several opportunities planned for collecting data to study the effect of providing 
consequences data. ACT had collected panelists’ reactions to consequences data provided at the end 
of the ALS process, and those data suggested that few panelists would make changes. Still, there 
was no evidence regarding what panelists would do when their actions could impact the cutscores. 
(Please see Appendix 1 for an example of the consequences feedback.) 

Panehsts were asked to complete a questionnaire in which they were given the opportunity to 
recommend new cutpoints that would raise or lower the percentages of students performing at or 
above each level. Those numbers were averaged and new cutpoints and consequences data were 
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presented. Panelists were allowed to discuss the cutpoints and encoxiraged to reach common 
agreement on a final set of cutpoints. 

The civics panehsts had a rather lengthy discussion of the resvdts and consequences, but there was 
relatively httle interest in chan g ing the cutpoints. Fo\ir of the eight panehsts recommended that 
the cutpoints be reported as set. One panehst recommended a change in only the Basic cutpoint — 
lowering it. Two panehsts recommended changes in two cutpoints. One of those panehsts wovild 
lower the Proficient and Advanced cutpoints and one wovdd lower the Basic and Proficient 
cutpoints. One panehst recommended changes to ah three cutpoints. 

The writing panehsts, on the other hand, seemed generahy appahed by the consequences data. 
Many spoke about their reluctance to “arbitrarily” change the cutscores, although most seemed to 
find the outcomes unreasonable. When asked whether the resvdts reflected their expectations, 
only one panehst in the ISSE rating group said “yes.” There were many fewer changes 
recommended by panehsts in the ME group for the writing FTl in response to the consequences 
data. Five panehsts in each group said they wovdd change one or more cutpoints, and fovu* of those 
five in the ISSE group changed ah 3 cutpoints. Ah changes to the Proficient and Advanced 
cutpoints by ISSE panehsts were to lower the cutpoints and increase the percentage of students 
scoring at or above the levels. Only one of the fovu* changes to the Basic cutpoint in the ISSE group 
was to make it higher. In the ME group, only one panehst changed the Basic and Proficient 
cutpoints, and five panehsts lowered the Advanced cutpoint. 

The recommended changes were used to compute new cutpoints and revised consequences data. 
Those were again shared with panehsts, and they were again asked to evaluate the data. 
Members of the civics FTl panel had no further changes to recommend, but the writing panels did. 

When asked whether the revised, “final” percentages reflected their expectations, ah panehsts in 
the ME group now said “no,” and five in the ISSE group said “no.” In the ME group, only one 
panehst wovdd raise the Basic cutscore and the rest wovdd leave it as set. Five ME panehsts wovdd 
lower the Proficient cutpoint and three wovdd leave it as set. Seven wovdd lower the Advanced 
cutpoint and one wovdd leave it vinchanged. They expressed a general lack of confidence in the 
“arbitrariness” of the cutpoints computed on the basis of recommendations. There was a general 
preference for the cutpoints based on their ratings. 

Three ISSE panehsts wovdd lower the Basic cutpoint and fovu* wovdd leave it vinchanged. Two 
wovdd raise the Proficient cutpoint, one wovdd lower it, and four wovdd not change it. 

Results 

TACSS reviewed ah data fi*om the field trials in both civics and writing. Rephcabhity of resvdts is 
one criterion TACSS suggested ACT use in evaluating the outcomes of the field trials for selecting 
a method. In general, the number of panehsts was smah for placing heavy emphasis on the 
numerical outcomes. The intent of the studies had been to ascertain how weh panehsts react to 
and interact with the methods and the feedback provided for each method. The evaluation data 
cohected from panehsts, along with observations of staff engaged in implementing the process 
were of greatest interest to TACSS. ACT conducted extensive analyses of the data, and only 
highhghts are presented here. 

Inteijudge consistency evidence and intrajudge consistency evidence were avsulable, to a 
somewhat hmited extent. Plake (1995) suggested that high inteijudge consistency (low variabvhty 
among panehsts’ ratings) covdd be used as an indicator of rephcabhity. Table 1 reports the 
cutscores and standard deviations on the ACT NAEP-Like scale for the civics FTl ratings and 
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Table 2 reports the data for both rating methods for the writing FTl ratings. There were no data 
to which standard deviations from the civics FTl could be compared. Ratings for the four blocks in 
the geography grade 8 pool are reported below, but those data could not be used in computing the 
standard deviations because of differences in rating groups and rating pools in the ALS process. 

Table 1 

Cutpoints and Standard Deviations for ISSE Ratings of 
4 Blocks of Items in the Grade 8 NAEP Geography Item Pool 





Basic 

Cutpoints 

(SD) 


Proficient 

Cutpoints 

(SD) 


Advanced 

Cutpoints 

(SD) 


ISSE Method (Round 1) 


149.62* 

(7.93) 


171.47 

(6.73) 


189.75 

(8.66) 


ISSE Method (Round 2) 


152.19 

(9.79) 


171.47 

(4.67) 


187.33 

(4.00) 



♦Cutpoints are reported on the ACT NAEP-Like score scale. 



The data in Table 2 for writing show the standard deviations to be lower for the ISSE than for the 
ME in writing. 



Table 2 

Cutpoints and Standard Deviations for ISSE and ME Ratings 
of 25-Minute Prompts in the 1992 Grade 8 NAEP Writing Pool 





Basic 

Cutpoints 

(SD) 


Proficient 

Cutpoints 

(SD) 


Advanced 

Cutpoints 

(SD) 


Writing FTl 
ISSE Method 
Round 1 


134.87* 

(7.02) 


177.81 

(10.18) 


229.12 

(6.92) 


Writing FTl 
ISSE Method 
(Round 2) 


137.15 

(1.82) 


174.24 

(5.9) 


221.92 

(10.52) 


Writing FTl 
ME Method 
(Round 1) 


147.31 

(14.78) 


184.15 

(12.17) 


220.83 

(12.05) 


Writing FTl 
ME Method 
(Round 2) 


142.63 

(12.95) 


176.39 

(10.92) 


213.05 

(9.47) 



* Cutpoints are reported on the ACT NAEP-Like score scale. 
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Ratings by civics FTl panelists using the ISSE resulted in higher cutpoints than those computed 
from ratings by geography ALS panelists on the same items.® For writing, the ISSE method 
produced cutpoints that were more extreme: the Basic cutpoint was lower and the Advanced 
higher than for the ME method. Since the NAEP ALS cutpoints have generally been criticized as 
being too high, a method that set even higher cutpoints would not likely be selected, other things 
being equal. The percentages based on the Roimd 2 item ratings by FTl civics panelists and 
Roimd 3 item ratings by the ALS panelists are reported in Table 3. 



Table 3 

Percentages of Students Scoring At or Above Cutpoints Set for 4 Blocks of Items 
in the Grade 8 NAEP Geography Item Pool 





% At or 

Above 

Basic 


% At or 
Above 
Proficient 


% At or 
Above 
Advanced 


Civics FTl 

ISSE Method (Roimd 2) 


61.6% 


11.6% 


0.3% 


Geography ALS 
ME Method (Roimd 3) 


66.3 


26.6 


5.5 



These results suggested that the percentages of students scoring at or above the levels would be 
lower using the ISSE method than the percentages at or above the cutpoints set in 1994 using the 
ME method. The 1994 grade 8 Geography cutscores based on aU items in the grade pool resulted 
in 71% of the students scoring at or above the Basic level, 28% at or above the Proficient level, and 
4% at or above the Advanced level. 

The results for the two methods used in the writing FTl were quite similar, but the differences 
showed the cutpoints using the ISSE method were slightly lower for Basic and considerably higher 
for Advanced than those using the ME method. This is, of course, contrary to the indications from 
the results of FTl in civics. Both the ISSE and ME cutpoints for writing FTl were higher than 
those for grade 8 in 1992 using the paper selection method. The roimd 2 FTl percentages of 
students scoring at or above the cutpoints are reported in Table 4 along with the data from the 
1992 ALS process. Please note that the process through which the 1992 ALS results were reached, 
along with the computational procedures, were different from those used in FTl for writing. 

Another measure evaluated by ACT and TACSS was changes in ratings. Reviewers of the NAEP 
ALS process often comment that there is little or no change in cutscores from roimd to roimd. This 
observation is often followed by questions regarding the necessity or utility of having 3 roimds of 
ratings. ACT has foimd that panelists typically change relatively large numbers of item ratings 
from roimd 1 to roimd 2, and that they change many fewer from roimd 2 to roimd 3. A summary of 
changes by levels is provided for panelists’ ratings in each of the two subjects. Table 5 reports the 
percentages of changes from Round 1 to Round 2 averaged over the 8 panelists for civics and Table 
6 reports the data for changes in ratings for writing, by method. 

Data in Table 5 show that item ratings at the Basic and Proficient levels were more frequently 
raised than lowered, but the proportion of ratings lowered at the Advanced level was just slightly 
greater than that raised. In general, however, more ratings were imchanged from roimd to roimd 
by the civics FTl panelists. 



3 These data are not entirely comparable because panelists did not participate in exactly the same process. Further, tiie items included 
in these four blocks were not all rated by the same panelists in the ALS process. The data are presented as a point of comparison. 
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Table 4 

Percentages of Students Scoring At or Above Cutpoints Set for Prompts 
in the 1992 Grade 8 NAEP Writing Item Pool 





% At or 

Above 

Basic 


% At or 
Above 
Proficient 


% at or 
Above 
Advanced 


Writing FTl 
ISSE Method 
(Round 2) 


89.9% 


8.7% 


0.0% 


Writing FTl 
ME Method 
(Round 2) 


82.0 


6.2 


0.0 


Writing ALS 
Paper Selection Method 
(Round 3) 


85.0 


15.4 


0.1 



Table 5 

Percentages of Civics FTl ISSE Item Ratings Chan g ed 
from Round 1 to Round 2 

for 4 Blocks of Items in the Grade 8 NAEP Geography Item Pool 



% 


% 


% 


% 


% 


% 


% 


% 


% 


Basic 


Basic 


Basic 


Proficient 


Proficient 


Proficient 


Advanced 


Advanced 


Advanced 


Ratings 


Ratings 


Ratings 


Ratings 


Ratings 


Ratings 


Ratings 


Ratings 


Ratings 


Raised 


Same 


Lowered 


Raised 


Same 


Lowered 


Raised 


Same 


Lowered 


15.0% 


73.6% 


11.4% 


13.0% 


72.3% 


9.2% 


4.3% 


91.0% 


4.7% 



Table 6 

Percentages of Writing FTl Item Ratings Changed (by Method) from Round 1 to Round 2 
for 8 Prompts in the Grade 8 NAEP Writing Pool 





% 

Basic 

Ratings 

Raised 


% 

Basic 

Ratings 

Same 


% 

Basic 

Ratings 

Lowered 


% 

Proficient 

Ratings 

Raised 


% 

Proficient 

Ratings 

Same 


% 

Proficient 

Ratings 

Lowered 


% 

Advanced 

Ratings 

Raised 


% 

Advanced 

Ratings 

Same 


% 

Advanced 

Ratings 

Lowered 


Mean 
Estima- 
tion (n=8) 


7.9% 


50.8% 


41.3% 


0.0% 


44.4% 


55.6% 


3.2% 


42.9% 


54.0% 


ISSE 

(n=7) 


19.4% 


68.1% 


12.5% 


9.7% 


69.5% 


20.8% 


1.4% 


77.8% 


20.8% 



Relative to civics, a much larger proportion of items was changed in the writing field trial. The 
rnggority of item ratings were \mchanged for panehsts using the ISSE method. Panehsts who used 
the ME method changed a larger proportion of items than panehsts who used the ISSE method. 



O 
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Panelists using both methods tended to lower their ratings more frequently than to raise them, 
and this was especially true for Proficient and Advanced level ratings. 



Evaluations of Panelists . To better iinderstand panehsts’ perceptions of the rating process, they 
were asked to respond to three questionnaires with Likert-type scale items — one questionnaire 
after each roiind of ratings and another at the conclusion of the meeting.^ Evaluation data from 
both the geography and U.S. history ALS panehsts were analyzed, along with the civics FTl data 
for comparison. ACT was interested to learn how panehsts reacted to vEtrious aspects of the field 
trials, and panehsts’ evaluations of the relative ease of rating items with the two methods was one 
important aspect to be determined. 

Responses by panehsts in civics FTl using the ISSE method suggest that their understanding of 
the tasks, confidence in their ratings, and so forth were nearly as high as those reported by ALS 
panehsts in geography and U.S. history after two roiinds of ratings using the ME method in 1994. 
The responses to questions specifically about the rating methods were somewhat more positive for 
the ISSE panehsts than had been the case for the panehsts using the ME method when multiple- 
choice items were rated. The responses about using the ISSE method for rating constructed 
response items were somewhat less positive. The mean responses to those questions are reported 
in Table 7. 

The writing FTl data are reported in Table 8, and they are, of course, only for rating constructed 
response items. Those responses are mixed. Conceptual clarity was rated higher for panehsts 
using the ME, but ease of apphcation was rater higher for panehsts using the ISSE method. 

Table 7 

Mean Response Score to Selected Questions about Methods in Roiind 2 Ratings 
by Civics FTl Panehsts and ALS Panehsts in Geography and U.S. History 





Civics FTl 


Geography 


U.S. History 


The method for rating multiple-choice 
items was conceptuaUy clear. 

(5 = TotaUy Agree; 1 = TotaUy Disagree) 


4.63 


4.32 


4.43 


The method for rating multiple-choice 
items was easy to apply. 

(5 = TotaUy Agree; 1 = TotaUy Disagree) 


4.50 


4.14 


4.32 


The method for rating constructed 
response items was conceptuaUy clear. 
(5 = TotaUy Agree; 1 = TotaUy Disagree) 


3.88 


4.25 


4.17 


The method for rating constructed- 
response items was easy to apply. 

(5 = TotaUy Agree; 1 = TotaUy Disagree) 


3.88 


3.89 


4.07 



4 Responses to questionnaire items by panelists in each field trial are available upon request. 
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Table 8 

Mean Response Score to Selected Questions about Methods in Round 2 Ratings 

by Writing FTl Panelists 





ISSE (FTl) 


ME (FTl) 


The method for rating prompts was conceptuaUy clear. 
(5 = TotaUy Agree; 1 = TotaUy Disagree) 


3.88 


4.14 


The method for rating prompts was easy to apply. 
(5 = Totgdly Agree; 1 = TotaUy Disagree) 


3.75 


3.00 



Questionnaire data indicate that FTl writing paneUsts using the ME method increased their 
confidence and understanding of the process and tasks more so than those using the ISSE method. 
Indeed, the conceptual clarity of the rating method (reported in Table 8 above for Round 2) 
increased fi*om Round 1 to Round 2 for paneUsts using the ME method and the ME method became 
easier for them to apply. The responses of ISSE paneUsts indicated no improvement by Round 2 in 
the conceptual clarity of the rating method, and their evaluation of the ease of appl 3 dng the 
method indicated that it was less easy to apply in Round 2 than in Round 1. 

Consequences Data. PaneUsts were given consequences data based on their Round 2 ratings. An 
example of the format used for reporting consequences data is provided in Appendix 1. The 
percentages reported above in Table 3 for civics and Table 4 for writing are the consequences data 
shared with paneUsts in the field trials. PaneUsts were asked to evaluate the data and then they 
were asked to recommend new cutpoints if they felt the consequences data were not reasonable. 
The general changes recommended were reported in the Process section above. 

The new cutpoints recommended by each paneUst were averaged. For paneUsts who chose to 
recommend unchanged cutpoints, the grade level cutpoints fi*om Round 2 were the values used to 
compute the new average. For civics, the averages are 150.78 for Basic, 170.24 for Proficient, and 
186.48 for Advanced. Their recommendations were generaUy to lower the cutscores for each level. 
During the discussion they decided that those averages were to be their final recommendation. 
Only one person suggested a change fi*om that average. Based on those new cutpoints, 64.3% of 
grade 8 students would score at or above Basic, 13.6% at or above Proficient, and 0.4% at or above 
Advanced. 



For writing, there were many more changes recommended, as noted earUer. The SEune procedure 
was used for computing the new cutscores, based on the recommendations for change. The two 
groups were each engaged in a separate discussion of consequences data. PaneUsts using the ME 
method lowered the cutpoints for the Basic level somewhat, and they lowered the cutpoints 
considerably for both Proficient and Advanced. Their recommendations, based on the 
consequences data, resulted in 83.3% of the students scoring at or above the Basic level, 14.9% at 
or above the Proficient level, and .009% at or above the Advanced level. The Advanced cutscore 
was lowered fi*om 213 to 206, but this was not nearly low enough to include even 1% of the grade 8 
scores for students in the 1992 Writing NAEP. 



For FTl writing paneUsts using the ISSE method, only one paneUst recommended changes in the 
Basic and Proficient cutpoints, and those changes had Uttle impact on the final results. Five 
paneUsts recommended that the Advanced cutpoint be lowered, although two of those 
recommendations were for only very minor changes. The final Advanced cutpoint for the ISSE 
group was 217.3, and the percentage of students of student scores at or above this level was less 
than 0.00%. 






lxx)mis/Montreal/NCMEyApril 1999 



12I4 



Panelists were generally favorable to having consequences data and to having the opportunity to 
recommend c han ges. They felt, however, that recommending changes at the end of the rating 
process introduced arbitrariness to the process that caused them to feel less inclined to make 
changes and less positive about the opportunity. This reaction indicated that they would prefer 
having the consequences data earher in the process, during the rounds of ratings. 

Conclusions for Field Trial 1 

Results from FTl were to have determined which rating method to use in the ALS studies. ACT 
proposed that an item mapping method also be used in conjunction with whatever rating method 
was selected as a result of FTl. FT2 was then to be conducted to examine the interface of the two 
methods when used jointly, to evaluate how panehsts perceived the two methods when used 
together and separately, and to determine the impact on cutscores using two rating methods in one 
process. 

After careftiUy reviewing the data reported by ACT from the first field trial for each subject, 
TACSS was unable to recommend one of the two methods as the unambiguous choice to carry 
forward to the second field trial and remainder of the ALS process. Both NAGB and TACSS stiU 
wanted to have information from other methods in writing. TACSS recommended that ACT 
design the second field trial for civics with the two item-by-item rating methods used in field trial 
1, and include the other research factors origmally planned with one method. For field trial 2 in 
writing, they were eager to see results from several alternatives. 

Detection of Bias in ISSE 

Results of the field trials were viewed somewhat skeptically by TACSS. There was concern that 
the ISSE method would result in more extreme cutscores due to a bias. That is, some TACSS 
members felt that the method would necessarily result in lower Basic cutscores and higher 
Advanced cutscores, compared to the “true score” (Reckase, 1998; Bay, 1998). As a result of work 
by both Reckase and Fors 3 rth, and based on the findings of the first field trials, TACSS 
recommended in May, 1998 that ACT discontinue fiuther research using the ISSE method. They 
recommended that ACT explore alternatives already imder consideration. 

Field Trial 2: A Comparison of Methods and Timing of Consequences Feedback 
ACFs plan had been to use the method selected on the basis of field trial 1 as the first of two 
methods used in setting achievement levels in field trial 2. The second method was to be an item 
mapping method that would allow panehsts to make adjustments to their cutpoints directly. That 
is, rather than continuing with adjustments to item ratings after two or three roimds, panehsts 
would switch to maps or charts with the items arrayed according to some statistical criteria, e.g., 
response probabihty=65%. The second field trial had been planned as an opportunity to study the 
interface of the two methods for setting achievement levels: an item-by-item rating method and an 
item mapping method. 

Since no method had been selected as a result of the first field trials, and since the ISSE method 
had been eliminated from further consideration, the design of the second field trials had to be 
changed. The goal was stih to test several alternatives for writing. Meanwhile, Reckase had 
proposed a new method, and TACSS agreed that it should be tested in the second field trials. 

In addition to studying alternative standard setting methods — inclu din g interfacing methods, ACT 
wanted to collect data on the issue of providing consequences data. Specifically, ACT wanted data 
to help decide when in the process to provide the data and how often to provide it. When panehsts 
are given consequences data at the end of the process, few recommend changes in response to the 
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data. NAGB needed to know whether this general trend would hold if panelists were given 
consequences data during the process when the cutscores coidd actually be altered. The data from 
FTl would be supplemented through FT2 research. 

ACT was to develop several design alternatives for consideration by TACSS. The designs from 
which TACSS selected are included in Appendix 2 along with the design adopted for writing FT2. 
Of those designs considered for civics FT2, Design 3 is the one adopted. 

TACSS recommended that ACT test the Mean Estimation Method with Item Mapping and the 
Mark Reckase Method in the civics FT2. In order to include research on consequences data, foiu* 
groups were needed for the study, and each group was to include 10 panelists each. 

The decision on methods to test in the writing FT2 was more difficult. TACSS was somewhat 
divided with respect to the choice. In order to collect data on the impact of consequences data in 
writing FT2, no more than two different rating methods seemed feasible. Ultimately, TACSS 
decided that the Booklet Classification method and the Mark Reckase methods shoiild be tried in 
this study. The Crid method was rejected because the computational procedures had not yet been 
determined and it was a totally new procedime (Bay and Loomis, 1998). The two methods selected 
included both an item-by-item and a holistic approach. Before describing the process, more 
information about the methods is needed. 

Alternative Methods for FT2 
Item Mapping Method 

The Item Mapping (IM) method investigated in the civics field trial is very similar to the 
Bookmark method used by CTB-McCraw Hill (Lewis, Mitzel, and Creen, 1996). ACT implemented 
IM procedimes in the 1996 Science NAEP ALS process (ACT, 1997c), for research purposes and for 
grade 8 panelists when they were reconvened and given to change their ALS recommendations. 
The IM method uses a linear chart that indicates the approximate range of test scores earned on 
the NAEP. Each item was located or mapped at a point on the ACT NAEP-like scale where student 
performance reaches a 65% probability of correct response for the item. By studying the item map, 
one is able to determine which test items were responded to correctly 65% of the time by students 
who scored within a certain range (i.e., achievement level) on the score scale. Items were 
identified by a sequential number representing their r ank with respect to difficidty. Abbreviated 
item descriptions, along with the score point at which each item “mapped” were included in the 
materials provided to panelists. 

Dichotomous items are mapped directly. Polytomous items are dichotomized at each score level 
and then mapped to the scale in the same manner as dichotomous items. Thus, each polytomous 
item is mapped to the scale one time fewer than the number of score levels. FT2 used a similar 
mapping procediu-e. The exact mapping criteria were never fully “approved” by TACSS because 
they were never fully in agreement with the use of item maps. Ultimately, the decision was to use 
the criterion used most frequently in NAEP reporting, i.e., 65%. 

The IM method was studied in copjunction with the ME method. Panelists in the ME group rated 
items on an item-by-item basis for two rounds, and then they were given item maps and lists with 
item information to identify the items. They examined the maps and their other feedback data to 
decide whether the group cutscore shoidd be modified. They recorded the recommended cutpoint 
for each level on their item map. The recommended cutpoints were averaged for computing the 
final cutpoint for each rating group. 
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The Mark Reckase Method 

ACT had experimented over the years with various methods of providing information about 
intrarater consistency. TACSS had generally judged the efforts as less than successful, and 
intrarater consistency feedback was dropped from the feedback provided to panelists in the 1996 
Science ALS process. The Reckase method addressed the need to provide panelists with useful, 
easily understood information. The information in the Reckase Charts would help them evaluate 
the consistency of their ratings for items of different formats, different content dimensions, and 
other item characteristics. 

Reckase proposed to have panelists use an item-by-item rating method to generate an initial set of 
ratings. The modified Angof&ME method was suggested. Charts, now known as “Reckase Charts,” 
were presented to panelists after the first round of item ratings. These Reckase Charts include 
expected student response scores for each item at each point on the score scale. A coliunn on the 
chart contains expected score data for one item across all score points. A row contains expected 
score data for one point on the score scale across all items. The expected score data are generated 
by the IRT model. For polytomous items, the expected score data are reported as a mean score for 
each item. For dichotomous items, the expected score data are reported as the probabUity of 
correct response/percentage of students responding correctly. 

For writing, all prompts in the rating pool could be printed on one large chart. For civics, each 
block required a separate page or chart. For FT2, separate color-coded charts were prepared for 
each of the three achievement levels: Basic ratings were marked on blue charts. Proficient ratings 
on pink, and Advanced on amethyst. 

The Reckase Method required that panelists transfer their item ratings from Round 1, for 
example, to the charts. By marking ratings on the charts, panelists would be able to visually 
inspect their ratings for each item with respect to their own individual cutscore, the grade level 
cutscore, item type (multiple choice and constructed response), and content/prompt type 
(persuasive, informative, and narrative, for example). 

Panelists inspected the charts, along with other feedback data, and decided on ratings for each 
item in a second round of item-by-item ratings. The third round of ratings required panelists to 
select a row, i.e., a score, to represent their cutpoint for each achievement level. 

The amount of information available to panelists through the use of Reckase Charts was great. 
There was some concern, however, that ratings would be “data driven,” and that panelists would 
loose their focus on the achievement levels-descriptions (the standards) to be used in making their 
judgments. As a result of this concern, TACSS recommended that the charts presented after the 
first round of ratings exclude the ACT NAEP-Like scores. Each row on the charts was identified 
with alpha coding. Panelists would only have their item ratings on the charts to evaluate before 
round 2 ratings. They were instructed to examine items for which ratings appeared particularly 
high or low to determine whether any patterns emerged based on item type or content area, and to 
pay particular attention to ratings for which their confidence was especially high or low when 
rating the item. An example of a chart is included in Appendix 2. 

After Round 2, new charts were again distributed to panelists, and they again transferred their 
ratings to the charts. This time, the ACT NAEP-Like score points were included on the charts. 
Panelists could evaluate their ratings relative to the grade level cutscores and relative to their own 
cutscores although they were not explicitly instructed to do so. 
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The Reckase Method was to be tested in the second field trial for each subject. Because the final 
round of ratings was not an item-by-item rating procedure, item maps would not be tested in 
copjunction with the Reckase Method. Instead, Round 3 ratings in the Reckase Method required 
panelists to draw a fine to identify the cutscore for each level. A space on the rating form was 
provided to record the cutscore for each level. 

Booklet Classification 

ACT had used a booklet classification method in validation studies for geography, U.S. History, 
and science (ACT 1995; 1997d). Results of those studies suggested that the outpoints resulting 
fi*om a Booklet Classification (BC) method would be higher than with the mean estimation 
method. ACT (Hanson, Bay, and Loomis, 1998) conducted further research on this method. 
Evidence suggested consistently that the cutpoints would be set higher with a BC method than 
with the ME method. Since “reasonable” outcomes are a goal of the ALS process, the method 
seemed to hold little promise. In addition to the reasonableness of outcomes, however, was the 
issue of how to compute cutscores with a BC method in NAEP. Alternatives were presented to 
TACSS for their review (Hanson, 1998). 

There were many important issues to be resolved regarding the BC method (Bay, 1998). The 
number of booklets to be classified by panelists, the number of categories for the classification, the 
distribution of scores for booklets selected for the study and the criteria for determining the “score” 
in the NAEP context of plausible values, the number of different booklet forms to use to represent 
the assessment pool and not overburden the panelists, and so forth. Bay designed the study (Bay, 
1998) and a more detailed description of the design implemented in the writing FT2 study is 
included in Hanson and Bay (1999). 

Each panelist classified 40 booklets. There were 20 booklets in each of two different forms. Each 
panelist had forms including at least one prompt for each type of writing. In order to provide 
panelists the opportunity to discuss booklet classifications after Round 1, the design provided 10 
booklets in each form to be classified by two people who would be seated together. Thus, each 
panelist had 10 booklets of form A to discuss with panelist A (on the right) and 10 booklet of form 
B to discuss with panelist B (on the left). 

TACSS recommended that booklets be ordered on performance fi*om lowest to highest. Panelists 
were told that the ordering was to facilitate their task and that it represented only one of many 
such orderings that might be used. They were told that their classifications did not have to reflect 
the ordering because classifications were to be made on the basis of the achievement levels 
descriptions. 

Mean Estimation 

The mean estimation method is the method used by ACT for setting achievement levels for NAEP 
since 1994. The method uses a modified-Angoff rating judgment for dichotomous items and 
estimation of the mean score for polytomous items (ACT, 1997c). The method was used in FTl for 
both civics and writing. 

The Panels 

The plan for recruiting panelists for field trial 2 was the same as that planned for FTl. Given the 
lack of success with recruiting panelists in FTl for each subject, however, the plan clearly had to 
change. TACSS advised that it was imperative that FT2 in each subject include at least 10 
panelists for each method/procedure tested. In order to meet the requirements for the number of 
panelists, ACT scheduled the second field trials in the summer, after the school term had ended. 
The meetings were also scheduled on weekdays because of suggestions firom nominators and 
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potential panelists that this would increase participation. Further, at the recommendation of 
TACSS, NAGB authorized ACT to offer an honorarium of $300 for the two-day field trials. ACT 
had no problem with recruiting the required number of panehsts when these changes were 
announced to nominators. We accepted panehsts who volunteered; our usual selection process was 
not implemented. We did attempt to recruit panels representing educators and non-educators. 
People in specific positions nominated candidates to serve on the field trial panel, and candidates 
were screened to assure that they had content knowledge and famihsirity with students at grade 8. 

For FT2 in civics, there were 43 panehsts: 32 teachers, 5 nonteacher educators, and 6 general 
pubhc members of the panels. There were 27 men and 16 women in FT2 for civics. For writing, 
there were 40 panehsts: 30 teachers, 3 nonteacher educators, and 7 general pubhc panehsts. 
Thirty-three panehsts in FT2 for writing were women and 7 were men. Panehsts were assigned to 
the rating groups so that each was as equivalent as possible. 

One error occurred in the civics FT2 assignment of panehsts to groups. The initial assignments 
were made so that each rating method group was as equivalent as possible and, within each 
method group, each consequences data treatment group was also as equivalent as possible. 
Dvuing training exercises, however, staff noted that one particular group was taking far longer to 
complete tasks than others. The decision was made to reassign some panehsts fi”om one table to 
another, i.e., one consequences data treatment group to another, within the ME rating group. The 
reassignments were made on the basis of the panehst identification number. This reassignment 
made it much easier to distribute materials, but it resulted in having only teachers at one table 
rather than a mix of panehst t 3 q)es. 

The Process 
The Design 

FT2 in each subject included two methods. For civics FT2, the Mark Reckase method and the ME 
method with item mapping were implemented. At least 10 panehsts were assigned to each group. 
These are the four groups. 

Civics FT2 

a. mean estimation with item maps and consequences data after each round 

b. mean estimation with item maps and consequences data after round 3 

c. Reckase method with consequences data after each round 

d. Reckase method with consequences data after round 3 

Writing FT2 

a. booklet classification with consequences data after each round 

b. booklet classification with consequences data after round 2 

c. Reckase method with consequences data after each round 

d. Reckase method with consequences data after round 3 

Implementation of the Process 

The planned field trial process was implemented in each subject with relatively few problems. The 
same orientation and training were provided for the FT2 panels as were described previously for 
the FTl panel. These panehsts also took a form of the NAEP, just as aU NAEP ALS panehsts do. 
Participants were divided into equivalent groups, as described for FTl, and they were also 
assigned to table groups to be as equivalent as possible. The civics groups rated the same set of 
items described for FTl, and both methods groups used the same rating method during the first 
round. Civics FT2 panehsts were trained together through the first round of ratings. 
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Writing panelists using the Mark Reckase (MR) method rated the same items described for FTl. 
Panelists using the Booklet Classification (BC) method had fewer diff erent prompts in their pool 
for classification, but they were fi*om the 1992 Writing NAEP and one prompt of each t 3 T)e was 
included in the forms for classification. 



Panelists were again engaged in training exercises to become familiar with the assessment and the 
achievement levels descriptions. The first round of ratings/classifications was scheduled at the 
end of the first day. The fadlitator of each group provided tr ainin g in the rating methods for FT2 
writing. 

Outpoints and other feedback data were computed and produced to distribute to panelists at the 
start of the second day. The feedback described for FTl was provided to all panelists in the second 
field trials. One exception was that BC panelists did not participate in the whole booklet exercise. 
In civics, all panelists were trained together in the feedback common to the two methods. 
Following the general session of training, panelists were provided information in each rating group 
regarding feedback specific to their method. Groups C and D were instructed in Reckase Charts. 
Four sets of data were prepared for distribution to FT2 panelists. Although all ratings in FT2 
civics were based on the same method for Roiuad 1, separate data reports were prepared. Panelists 
in each group would have separate reports in subsequent rounds, and it seemed a good idea to give 
them separate reports beginning with Roiind 1 results and feedback. Because the rating methods 
for the writing FT2 were so different fi*om the start, feedback was computed for each group 
separately and tr a i nin g in the feedback was conducted separately for each method group. 

(Please refer to the design of FT2 for writing in Appendix 2. 

Panelists in groups A and C received consequences data after Roiind 1 and after each subsequent 
roiind. Panelists in these groups were trained in consequences data and provided the information. 
A form was distributed to each panelist in the two groups and they were asked to comment on the 
consequences data. For FT2 in civics, the facihtator forgot to distribute the questionnaire iintil 
after the panelists had merged back with their rating groups. The panelists were interrupted 
briefly and asked to complete the questionnaire. No major problem was apparent as a result of 
this error. 



Results 

General Overview 

Panelists were generally receptive to each method. Panelists foiind the Reckase Charts very 
informative and interesting. Similarly, FT2 civics panelists were enthusiastic about the 
information about student performance information represented in the item maps. Each method, 
or combination of methods was of interest to ACT. This was our first experience with any of the 
three methods, as such, being tested in the field trials. Panelists seemed to have no problems with 
the item maps (civics only) and the Reckase Charts (both civics and writing). Having the booklets 
ordered, in writing FT2 seemed to sharply change the task fi-om the validation studies using 
booklet classification implemented previously by ACT. Rather than placing booklets in categories 
of achievement, they simply wrote their classifications on the booklets and on their “rating” form. 
Panelists did not classify booklets according to the ordering. That is, they did classify some 
booklets with higher ranks at lower levels than others aroiind the same rank, and they classified 
booklets fi-om lower ranks at higher levels than others around the same rank. Forty booklets, 
ordered on performance, and involving only two forms, did not present a challenging task to the 
panelists. They really appreciated the opportunity of discussing booklet classifications, and 
panelists in other groups seemed to really epjoy the opportimity of discussing item maps and 
Reckase Charts. ACT was fiu-ther convinced of the importance of providing time for panelists to 
discuss tasks among themselves. 






jOomis/Montreal/NCME/April 1999 




Findings for the writing field trials require some caution with respect to comparisons of cutscores. 
As was true for FTl, writing FT2 panehsts worked with the achievement levels descriptions 
developed for the 1998 ALS process. In general, the cutscores set by FT2 writing panehsts in both 
methods groups were lower than those firom FTl. The cutpoints set at the Proficient level, and the 
consequences data associated with those scores, were particuleurly uncommon relative to ALS 
results for other subjects. The percentage of students scoring at or above the Proficient level set by 
panehsts in writing field trials were generahy quite low. As discussed below, panehsts in the MR 
group in writing FT2 lowered the Proficient cutscore after seeing consequences data and reversed 
this general finding. 

Findings 

Round 1 cutscores and feedback ratings for panehsts in Group A of FT2 civics showed that several 
raters gave much lower borderline Basic ratings for items than others in the group. The fachitator 
discussed their ratings with these panehsts and they indicated that they had “gotten off" track” in 
their ratings. The group as a whole was advised on how to interpret the feedback data, in hght of 
these errors. The decision was made to use the group mean to replace the cutscores of these 
panehsts in analyses of results. Data reported in Appendix 2 show results both with and without 
outhers. 

Results firom FT2 for the two subjects were inconclusive regarding the effects of consequences data 
on the cutpoints. In civics FT2, the group A panehsts using the ME method and receiving 
consequences data throughout the process set cutscores lower across all rounds, taken together, 
than those in group B using the ME method and receiving consequences data later in the process. 
Results for the group using the MR method were just the opposite. That is, the group C panehsts 
who received consequences data first after Round 1 set their cutscores higher across the rounds, 
taken together, than the group D panehsts who received consequences data later in the process. 
(Please tables and charts in Appendix 2.) OveraU differences between cutpoints for groups A and B 
using the ME method in civics FT2 were significant. OveraU differences in cutpoints for groups C 
and D using the MR method were not significantly different. The timing of consequences data did 
not appear to have an effect on the cutpoints for the MR method. Timing of consequences data did 
appear to have an effect on the cutpoints for the ME method. Panehsts who received consequences 
data earher in the process set lower cutscores. 

The data in Table 9 (below) report results for civics FT2 by rounds of rating and 
method/consequences groups. RecaU that aU panehsts used exactly the same rating method for 
Round 1 ratings. Panehsts in group A recommended no changes in their cutscores after Round 3, 
thus the final cutpoint for each panehst were aU the same. 

For writing FT2, the two methods implemented were quite different. The BC method groups 
classified booklets two times and then discussed consequences data, whereas the MR method groups 
had two rounds of item-by-item ratings before deciding their cutpoint for each level on the Reckase 
Charts. 

In general, panehsts in the BC method group set cutpoints across the three rounds that did not 
differ sig nifi cantly by the timing of consequences feedback data. Data reported in Table 10 below 
show that BC panehsts who received consequences data set shghtly higher cutscores than those who 
did not. On the other hand, panehsts in writing FT2 using the MR method and receiving 
consequences feedback data throughout the process generally set higher cutscores. At the Advanced 
level, cutscores set by MR panehsts were higher than those set by BC panehsts, no matter when 
consequences data were introduced. 
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Table 9 



Cutpoints, Standard Deviations and Percentages of Students Scoring 
At or Above Each Achievement Level for FT2 Civics: By Group and Round of Rating 



Level 


A(n=10)M 


E/IM 


B(n=ll)M 


E/IM 


0 (n=ll) W 


[E/MR 


D (n=ll) ME/MR 


Outpoint 

(SD) 


%=> 


Outpoint 

(SD) 


%=> 


Outpoint 

(SD) 


%=> 


Outpoint 

(SD) 


%=> 


Round 1 

Basic 

Proficient 

Advanced 


136.4 (25.5) 
157.6 (5.1) 
168.9 (5.8) 


89.8% 

46.3 

16.0 


146.5 (12.2) 
164.5(5.1) 
175.3 (4.1) 


74 . 3 % 

26.6 

6.0 


145.9(9.3) 

163.2(4.4) 

173.9(3.6) 


75.1% 

29.7 

7.9 


138.6 (11.3) 
162.5 (4.5) 
174.1(5.0) 


87 . 4 % 

32.0 

7.4 


Round 2 

Basic 

Proficient 

Advanced 


146.7 (7.8) 
161.0 (4.8) 

171.8 (5.9) 


73.5 

35.8 

10.9 


151.3 (5.1) 
166.6(3.1) 

177.3 (2.7) 


63.5 

21.6 
4.0 


148.4 (7.4) 
165.6 (2.7) 
177.0(3.1) 


70.0 

23.5 

4.4 


143.5 (8.6) 

162.5 (4.7) 

175.5 (4.2) 


79.5 

32.0 

6.0 


Round 3 

Basic 

Proficient 

Advanced 


149.7(4.1) 
163.2 (2.9) 
174.5 (4.3) 


67.1 

29.2 
7.0 


153.2 (3.4) 
167.1(2.8) 
178.0 (2.3) 


58.6 

19.6 
3.4 


149.2 (6.0) 

164.3 (3.3) 
176.7 (4.8) 


68.2 

26.6 

4.4 


146.5 (8.6) 
163.2 (4.1) 
177.4 (5.1) 


73.5 

29.7 

4.0 


Final 

Basic 

Proficient 

Advanced 


149.7 (0.0) 
163.2 (0.0) 
174.5 (0.0) 


67.1 

29.2 
7.0 


153.1(2.0) 

167.2(1.1) 

177.9(1.1) 


58.6 

19.6 
3.7 


145.9 (3.8) 
163.7 (0.7) 
175.0 (1.9) 


75.1 

28.8 

6.5 


147.6 (4.0) 

162.6 (1.3) 
177.2 (0.7) 


71.8 

32.0 

4.4 



[^ote: Data printed in hold italics were not presented to panelists in the process. 



Table 10 



Cutpoints, Standard Deviations and Percentages of Students Scoring 
At or Above Each Achievement Level for FT2 Writing: 
By Group (n=lO each) and Round of Rating 





A 

Booklet Classification 
Consequences all 
Rounds 


B 

Booklet 
Classification 
Consequences after 
Round 2 


c 

Reckase Method 
Consequences all 
Rounds 


D 

Reckase Method 
Consequences after 
Round 3 




Outpoint 


%=> 


Outpoint 


%=> 


Outpoint 


%=> 


Outpoint 


%=> 


Level 


(SD) 




(SD) 




(SD) 




(SD) 




Round 1 


















Basic 


128.1 (7.6) 


96.9% 


129.7(11.1) 


96 . 1 %> 


138.5 (14.6) 


97.8% 


123.1(14.0) 


98.7% 


Proficient 


157.1 (8.4) 


44.8 


160.6 (6.6) 


34.9 


179.4 (9.8) 


3.6 


164.6 (13.2) 


25.5 


Advanced 


194.5 (11.8) 


0.2 


187.1(14.0) 


0.8 


217.5(13.1) 


0.0 


211.3 (13.7) 


0.0 


Round 2 


















Basic 


131.6 (8.9) 


94.9% 


131.5 (7.5) 


94.9% 


134.1(20.1) 


92.9% 


124.8 (11.5) 


98 . 4 %> 


Proficient 


156.8(11.1) 


44.8 


154.3 (8.6) 


52.3 


167.5 (16.9) 


19.0 


166.3 (15.3) 


21.4 


Advanced 


196.5 (9.7) 


0.1 


190.0 (8.9) 


0.4 


202.9(16.1) 


0.0 


213.5(11.3) 


0.0 


Round 3 


















Basic 


Not 




Not 




136.2 (11.4) 


90.7% 


122.6 (11.1) 


98.8% 


Proficient 


Applicable 




Applicable 




171.5 (11.6) 


11.9 


165.7 (11.7) 


22.3 


Advanced 










208.6 (18.0) 


0.0 


215.7 (14.9) 


0.0 


Final 


















Basic 


133.8 (4.2) 


93.2% 


131.8(3.1) 


94.7% 


137.3 (2.8) 


89.9% 


125.4 (2.4) 


98.1% 


Proficient 


157.6 (2.6) 


42.8 


156.3 (3.3) 


47.0 


164.9 (7.2) 


24.8 


154.4 (4.2) 


53.4 


Advanced 


191.8 (3.9) 


0.3 


187.0 (3.8) 


0.9 


198.6 (9.3) 


0.0 


201.3(12.1) 


0.0 



Note: Data printed in bold italics were not presented to panelists in the process. 
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Panelists in the Booklet Classification method seemed to understand the consequences data less 
well than other panelists. They were confused by the results that reported essentially no students 
at or above their cutpoints at the Advanced level. This confusion resulted firom the fact that they had 
classified booklets at the Advanced level. They found it hard to understand why the proportion of 
the booklets they classified as borderline Advanced and Advanced were not more nearly reflected by 
the consequences data. 

The general pattern of change in cutpoints for the final round was to raise the Basic cutpoint 
shghtly and lower the Advanced cutpoint considerably. Both groups of BC panehsts raised the 
Proficient cutpoint, and both groups of MR panehsts lowered the Proficient cutpoint. The MR group 
receiving consequences data for the first time after Round 3 lowered the Proficient cutpoint by 11 
points and increased the percentage of students scoring at or above the cutpoint fi"om 22% to 53%. 

While all groups lowered the cutpoint for the final Advanced cutpoint, the MR group that first 
received consequences data after Round 3 lowered their cutpoint most. They lowered their cutpoint 
by 14.4 points. Even so, that score point was still generally well above the range of student 
performance on the 1992 NAEP. 

Data in Table 10 above (and in figures in Appendix 2) show the standard deviations for ratings 
across the rounds by method and consequences feedback groups. The standard deviations were 
considerably lower for the BC groups than for the MR groups. Standard deviations for'Round 2 were 
generally higher for groups receiving consequences feedback data prior to that round than for groups 
not receiving the data. The Reckase Charts in the MR method can reveal extensive iiiformation to 
panehsts. Further, the MR method requires panehsts to shift fi"om an item-by-item rating method 
to a more hohstic method of identifying a score point on the Reckase Chart. Perhaps that accounts 
for the relatively higher variabihty among raters. The Advanced cutpoints were sharply higher than 
for the other two achievement levels for group C panehsts for Round 3 and the final cutpoints. 

Evaluations bv Panehsts. Vast amounts of evaluation data were coUected fi"om the fovtr groups of 
panehsts in FT2 for the two subjects.® Data tables have not yet been prepared for the writing FT2 
evaluation, however, and only data firom the civics FT2 can be reported. (Please refer to tables in 
Appendix 2.) Further, only a few examples of residts are presented here, along with a brief 
summary of the general findin gs. Data such as those reported below in Table 11 were analyzed 
across rounds between consequences groups within rating method group, and between rating 
method groups. We generahy expect to find that methods become both clearer and easier to apply 
with each successive round of apphcation, for example. Here, we see that panehsts were somewhat 
less clear about the rating methods for multiple choice items in Round 3 than they had been in 
Round 2. The decline was not consistent across ah groups for constructed response items. The same 
pattern is observed for the ease of applying the method for multiple choice items for panehsts using 
the item maps in group A and for panehsts marking their cutpoints on the Reckase Charts in group 
D. 

Evaluations by panehsts revealed somewhat less confidence in ratings and less understanding of the 
methods after three rounds among civics FT2 panehsts who received consequences data than those 
who did not. Panehsts in the MR method group who received consequences data throughout the 
process were less confident in their selection of a cutpoint for each achievement level than panehsts 
who did not get the data until later, for example. The panehsts receiving consequences data 



5 Complete data for both subjects are/will be available upon request. The data are too extensive to reproduce in this report, but some 
tables are presented in Appendix 2 for civics FT2. 
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Table 11 



Mean Response Score on 5-Point Likert-Type Scale to Selected Questions 
About Rating Methods for Civics FT2 





Group 


Romid 1 


Round 2 


Round 3 


The method for rating multiple-choice items 


A 


4.1 


4.6 


3.5 


was conceptuaUy clear. 


B 


4.2 


4.6 


4.4 


(5 = TotaUy Agree; 1 = TotaUy Disagree) 


C 


4.2 


4.0 


4.2 




D 


4.6 


4.4 


4.0 


The method for rating multiple-choice items 


A 


4.0 


4.5 


3.5 


was easy to apply. 


B 


3.7 


4.4 


4.3 


(5 = TotaUy Agree; 1 = TotaUy Disagree) 


C 


4.0 


3.9 


4.0 




D 


3.9 


4.3 


4.0 


The method for rating constructed response 


A 


3.2 


4.4 


3.5 


items was conceptuaUy clear. 


B 


3.9 


4.1 


4.2 


(5 = TotaUy Agree; 1 = TotaUy Disagree) 


C 


3.6 


3.8 


3.9 




D 


4.2 


4.0 


3.7 


The method for rating constructed-response 


A 


2.9 


4.1 


3.6 


items was easy to apply. 


B 


3.6 


3.7 


4.2 


(5 = TotaUy Agree; 1 = TotaUy Disagree) 


C 


3.4 


3.8 


3.9 




D 


3.6 


3.8 


3.7 



throughout the process less frequently responded that the Reckase Charts were informative and 
revealing with respect to the consistency of their ratings on various dimensions related to item 
types. Relative to the other panehsts in the MR group, panehsts in group A found the Reckase 
Charts less helpful and less likely to bring their cutpoints closer to their concept of borderline 
performance for each level than their ratings had been without the data in the Reckase Charts. The 
differences in mean responses were not great, but this pattern was generally observed. 

Similarly, the ME group receiving consequences data throughout the process was somewhat less 
positive in their responses about the process and their ratings than panehsts receiving consequences 
data later. Again, the differences in the mean responses for the two groups were not great, but this 
pattern generally held. There was no obvious explanation for how consequences data, per se, could 
impact panehsts’ understanding of'abihty to use/confidence in using item maps, for example. These 
differences seemed more attributable to the general “personahties” of the panehsts than to the 
effects of consequences data, however. 

These patterns could have been a result of information “overload.” There was a general concern that 
panehsts were given more information than they could absorb in such a short amount of time. The 
field trials lasted less than half the time devoted to ALS meetings. 

Reactions to Consequences Data . RecaU that statisticaUy, the differences between civics FT2 
cutpoints for the ME groups were not significant and they were for the MR groups. In general, the 
consequences data appeared to have httle effect on the panehsts in the ME group and to have a 
greater effect on the panehsts in the MR group. This observation is based on the number of changes 
recommended by panehsts in the two groups in response to the consequences data provided. 



O 
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Panelists using the Reckase method tended to recommend more changes. The MR group receiving 
consequences data throughout the process recommended more changes after Rovmd 2 than after 
Roimd 1. All, or almost all, panehsts at one table in MR group A recommended changes to all three 
levels following their third review of consequences data. Seven panehsts in the MR group receiving 
consequences data only after Roimd 3 recommended 11 changes in cutpoints, but only 7 changes 
were recommended by 3 panehsts in the similar group using the ME method. Panehsts in the ME 
group receiving consequences throughout the process made no changes after the second round of 
consequences data. 

The results for writing FT2 were less clear. Panehsts in the BC method were generahy more 
confused by the consequences data than panehsts in other groups had been. Perhaps the confusion 
resulted from the fact that they were classifying booklets into categories. They reasoned that “a 
real student wrote each booklet,” and they expected some students in the Advanced level. They were 
reminded several times that their 40 booklets did not reflect the national distribution of student 
performance, and their comments suggested that they tried to keep this in mind. StiU, they had 
difficulty reconciling the consequences data with their classifications. When asked to recommend 
final cutpoints, the panehsts tended to recommend percentages within levels rather than 
percentages at or above levels or actual cutpoints, as requested oh the consequences questionnaires. 

No data tables for writing FT2 have been produced yet to show cutpoint changes for each panehst at 
each round by consequences treatment group. Only the overaU analyses of cutpoint differences were 
conducted in the limited time between field trials and pilot studies. Those results seemed sufficient 
to show that the effect of timing and frequency of consequences data was not consistent across the 
methods. 

Recommendations for Methods and Procedures to Implement in Pilot Studies 
Materials from the field trials were presented to TACSS during four different two-day meetings that 
were held prior to the pilot studies. During their July 1998 meeting, TACSS recommended the 
methods and procedures to be used in the pilot studies. These methods were, of course, also to be 
implemented in the ALS meetings imless some evidence was revealed in the pilot studies to cause 
modifications. 

One concern was that panehsts in the field trials had been given too much feedback. TACSS urged 
ACT to plan carefully the feedback to be presented to panehsts, the sequencing of the feedback, and 
the instructions. Panehsts need time to think about the feedback before applsdng it. 

Panehsts in the field trials were not as carefuhy screened, as they wih be for the pilots and ALS 
panels. And, there was far less time in the 2-day field trials than in the 5-day ALS meetings for 
panehsts to absorb the information. Sthl, there was concern. 

Consequences Data 

The findings regarding the timing of consequences data were neither conclusive nor unexpected. 
TACSS has consistently recommended that panehsts be informed about the consequences of their 
judgments. TACSS was not, however, convinced that the information would have a great impact on 
subsequent judgments of panehsts. TACSS simply beheved that panehsts should have the data. 
Recognizing that NAGB has never approved the use of consequences data within the ALS process, 
however, TACSS recommended that the consequences data be provided for the first time after 
Roimd 3. That gives panehsts the opportunity of recommending modifications to the cutpoints after 
seeing the consequences data and the recommendations of panehsts wih be used to compute the 
final cutpoints. The final cutpoints whl be recommended to NAGB, unless there was a reason not to 
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do so. The final outpoints will also be used in the selection of exemplar items and performances to 
be used in reporting student performance relative to the achievement levels. 

Following this recommendation, outpoints were to be produced fi*om the third roimd of item ratings 
without consequences data and fi*om cutpoints based on modifications to those Round 3 outpoints 
made in response to consequences data. TACSS recommended that NAGB be provided with both 
Round 3 and Final cutpoints. 

TACSS also recommended ACT provide individual level consequences data to panelists after Roimd 
3. They reasoned that panelists should have the opportunity of ac^usting their own cutpoints, based 
on data about the consequences of those cutpoints. They recommended that rater location charts be 
modified for Round 3 to include data about the proportional distribution of student scores. These 
charts would provide a visual representation of their own cutpoints and help panelists determine 
whether, in what direction, and by how much to modify their cutpoints. TACSS also recommended 
that panelists be provided with data reporting the cutpoints and consequences data for each panelist 
(using codes for identification) in their grade group. These modifications would provide panelists 
with more information to use in deciding on their final cutpoints. 

Rating Method(s) 

TACSS generally found no compelling reason to choose one method over another, based on field trial 
data alone. They were interested in how well panelists seemed to understand the process, and they 
looked for any indications that one method would produce more reasonable, consistent, reliable 
results. They placed a high value selecting a method for which considerable research had been 
conducted and for which ACT had relatively more experience. This pointed to the choice of the ME 
method, i.e., modified Angofif for dichotomous item ratings and estimation of the average score for 
polytomous items. 

TACSS did not, however, recommend the use of the ME method with item maps. Although ACT has 
conducted several different research studies with different mapping criteria, the choice of a response 
probability for mapping items remains an unresolved issue. The response probability (RP) used for 
mapping items determines the actual cutpoint, and the choice is clearly significant. In the absence 
of a policy regarding the RP value to use, TACSS recommended ag ains t the use of an item mapping 
procedure. 

Both ACT Project Staff and TACSS were impressed with the apparent ease with which panelists 
used the Reckase Charts, and they believed that the information available to panelists through the 
Reckase Charts should be incorporated into the process. Yet, there was concern that the MR method 
held the potential for being too “data driven” and that the final cutpoints would be based on chart 
data rather than the standards. 

TACSS reviewed the results of the Booklet Classification method and additional analyses by ACT of 
cutpoints based on borderline booklets versus cutpoints based on all booklets classified at the 
borderline and within the levels. The differences were disturbing. TACSS discussed alternative 
computational methods, but they decided to recommend ag ains t the use of the BC method for 
writing. This decision was based on their concerns about the computational procedures for the BC 
method in the NAEP context. It was also based on their concerns regarding the extensive 
production and logistics requirements associated with the BC method. 

The recommendation was to use the Mark Reckase method, but NOT have panelists identify 
cutpoints on the charts for the final round. That is, have panelists use the ME method for rating 
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items through three roxmds. The Reckase Charts will be provided to panelists prior to Round 2 and 
Round 3 ratings to inform them and to help them decide whether and how to modify ratings. 

They also recommended that the ACT NAEP-Like scale scores be printed on the Reckase Charts 
for each roxmd. There was discussion regarding the possibility of having ratings marked 
electronically on the charts, but ACT Project Staff voiced doubt that there would be enough time to 
perform this task between Roimds 2 and 3 which occur" within a few hours of each other. Further, 
some still believed that the panelists would gain a fuller xmderstanding of their ratings if they 
marked them on the charts. 

TACSS also recommended that panelists be instructed to draw lines on each Reckase Chart to 
represent both their own cutpoint and the grade cutpoint for each achievement level. This would 
allow panelists to examine their ratings with respect to these cutpoints. TACSS suggested explicit 
instructions for panelists regarding the interpretation of the data on the charts, relative to 
cutpoint data. 

Summary 

The field trials were conducted to test rating methods and the impact of consequences feedback. 
The field trials provided the opportunity to try out different methods similar to those used 
successfully by others, as weU as to try out some new methods. These field trials contribute 
significantly to the advancement of research information regarding some alternative standard 
setting methods. 

ACT had proposed a new method to be tested in the field trials, once it “passed the test” in 
simulation studies. Although successful implementation of the method (or a very similar version) 
had been reported, that method was foimd to be biased, and ACT stopped tests with the method 
after the first field trials. 

Reservations about the use of item maps were not overcome in the field trial process, and item 
maps were eliminated as a choice. Concerns about computational procedures and about the 
logistic demands of the Booklet Classification method eliminated this method 

TACSS strongly recommended that the method used for the 1998 ALS process have a solid 
research base. TACSS had foxmd no real reason to change methods and would not recommend 
doing so unless the alternative offered significant potential improvements in the process. 

TACSS recommended a new “combination” method combining the greatest benefits of the new 
Mark Reckase method with the strong research base and extensive experience by ACT associated 
with the Mean Estimation method. The recommendation proved to be a good one, and the 
procedures were implemented successfully to set achievement levels for the 1998 NAEP in Civics 
and in Writing. 
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Civics Field Trial 2 

Descriptive Statistics — Outliers Replaced with Means 



Means and Standard Deviations Across All Groups and Rounds 



Variable 


N 


Mean 


SD 


Basic 


129 


148.5 


7.6 


Proficient 


129 


164.2 


4.5 


Advanced 


129 


175.8 


4.6 



Means and Standard Deviations by Rounds Across All Groups 
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Mean 
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Basic 


43 
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Means and Standard Deviations by Grou 


ps Across All Rounds 


Group 


Variable 


N 


Mean 


SD 




Basic 


30 


148.6 


5.6 
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Proficient 


30 
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Basic 


33 
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33 
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33 
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33 
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33 


176.0 
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Mean Basic Cutpoints for Rating Method by Type of Feedback 
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Civics Field Trial 2 

Outpoints Set by Different Groups on Different Rounds 



Group A 



Group B 





Group C 



Group D 







Civics Field Trial 2 

Standard Deviations of Outpoints Set by Different Groups on Different Rounds 







Civics Fieid Thai 2 

Outpoints Set by Different Groups on Different Rounds 
Without the Possible Outliers 



Group A 



Group B 




Group C 




Group D 










Civics Fieid Trial 2 

Standard Deviations of Outpoints Set by Different Groups on Different Rounds 
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Writing Field Trial 2 

Descriptive Statistics — Outliers Replaced with Means 



Means and Standard Deviations Across All Groups and Rounds 



Variable 


N 


Mean 


SD 


Basic 


40 


129.8 


13.1 


Proficient 


40 


161.5 


14.2 


Advanced 


40 


200.8 


14.3 



Means and Standard Deviations by Rounds Across All Groups 



Round 


Variable 


N 


Mean 


SD 




Basic 


40 


129.4 


13.2 
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Proficient 


40 


• 165.5 


12.8 




Advanced 


40 


203.8 


18.7 
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40 
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13.1 
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40 


161.5 


14.2 
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40 


200.8 


14.3 


Means and Standard Deviations by Grou 


ps Across All Rounds 
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Mean 


SD 




Basic 
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20 


212.4 
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20 


122.5 


12.5 


D 


Proficient 


20 


165.9 


13.9 




Advanced 


20 


212.8 


12.2 




Field Trial 2 for Writing 

Outpoints Set by Different Groups on Different Rounds 



Group A 



Group B 





Group C 



Group D 







Field Trial 2 for Writing 
Standard Deviations of Outpoints Set 
by Different Groups on Different Rounds 



Group A 



Group B 





Group C 
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1998 NAEP Achievement Levels-Setting Process 
Field Trial 2 for Civics 

Summary of Responses to Process Evaluation Questions 
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Means and FVequencies of Responses to Questions Related to the Item Mapping Method 
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Means and Frequencies of Responses to Questions Related to the Reckase Method (Round 2) 
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Means and Frequencies of Responses to Questions Related to the Reckase Method (Round 3) 
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